The purpose of this analysis is to identify physical & chemical properties affecting white wine quality.
The dataset containing quality ranking of three wine tasting experts with details of chemical composition of 4898 white wine samples
This dataset is made public by,
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Dataset dimensions
## [1] 4898 13
Dataset content
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The dataset includes 13 columns (1x index, 11x input variables, 1x output attribute)
We will drop index column X
New dataset dimensions
## [1] 4898 12
Destiny plot of all variables & attributes
Looking into each variable in isolation from other attributes.
summary(subset(ww, select = -c(quality)))
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
1 - fixed acidity (tartaric acid - \(g / dm^3\))
Account for nonvolatile acids in wine, normally distributed around 6.8 \(g/dm^3\)
2 - volatile acidity (acetic acid - \(g / dm^3\))
Accounts of vinegar taste in wine and follow right skewed unimodal distribution
3 - citric acid (\(g / dm^3\))
Adds ‘freshness’ to the wine, found in small quantities. Follows a right skewed unimodal distribution
4 - residual sugar (\(g / dm^3\))
Right skewed distribution with high peak around the lower edge of IQR (at 1.7 \(g/dm^3\))
5 - chlorides (sodium chloride - \(g / dm^3\))
Accounts for salty taste in wine, right skewed with 75% of sample below 0.05\(g/dm^3\)
6 - free sulfur dioxide (\(mg / dm^3\))
Right skewed normal distribution around 34.00\(mg/dm^3\)
7 - total sulfur dioxide (\(mg / dm^3\))
Right skewed normal distribution around 134.0\(mg/dm^3\)
8 - density (\(g / cm^3\))
Density follow a normal distribution with mean at 0.994
9 - pH (0 most acid, 7 neutral, 14 most base)
pH is a contentious scale represents acidity where 0 is the most acid, 7 is neutral, and 14 is the most base. Analysis shows a normal distribution of pH around 3.1 with IQR ~0.2
10 - sulphates (potassium sulphate - \(g / dm3\))
Sulphates is a wine additive acts as antimicrobial and antioxidant. Distribution shows a skew to the right of distribution.
11 - alcohol (% by volume)
Alcohol follow a right skewed distribution with IQR between 9.50% and 11.40% by volume. yet the distribution shape is almost bi-modal (two peaks) with another peak around 12.5% by volume
12 - quality (score between 0 and 10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Quality distribution appears unimodal normal distribution centered at 6. with most wines have grades of 5, 6, and 7.
Create a new categorical variable quality.grp from quality as per the following slices - Bad : Wine quality rating less than or equal 5 - Good : Wine quality rating between 5 and 6 (6 included) - Great : Wine quality rating above 6
## Bad Good Great
## 1640 2198 1060
Alcohol, Ph, Total Sulfur Dioxide, & Sulphates seams to have a large IQR compared to total count.
Residual Sugar, Chlorides, Denisty, Free Sulfur Dioxide, & Citrix Acid seams to have multiple outliers (\(>2\mu\)) in comparison to other input parameters.
A bivariant analysis is required to identify possible correlation across different parameters.
We will start the bivariant analysis by identifying the correlation across different parameters in the data set.
Running the significance test at \(\beta \leq 0.05\) and omit statistically insignificant results from the correlation graph.
Graph above infers the following significant correlation relations
Strong positive correlation between Density and Residual Sugar
Strong negative correlation between Alcohol and Density
Medium positive correlation between
Quality and Alcohol
Density and Total Sulfur Dioxide
Medium negative correlation between
Quality and Density
Alcohol and each of (Residual Sugar, Total Sulfur Dioxide, Chlorides, Free Sulfur Dioxide)
The following table highlights correlation values between Quality & other variables in the dataset.
## Attributes QualityCorrelation AbsQualityCorrelation IQR
## 1 alcohol 0.436 0.436 Q4
## 2 density -0.307 0.307 Q4
## 3 chlorides -0.210 0.210 Q4
## 4 volatile.acidity -0.195 0.195 Q3
## 5 total.sulfur.dioxide -0.175 0.175 Q3
## 6 fixed.acidity -0.114 0.114 Q2
## 7 pH 0.099 0.099 Q2
## 8 residual.sugar -0.098 0.098 Q2
## 9 sulphates 0.054 0.054 Q1
## 10 citric.acid -0.009 0.009 Q1
## 11 free.sulfur.dioxide 0.008 0.008 Q1
Based on above correlation analysis.
Hence, we will
We will start by plotting all variables against each others mapped to Wine Quality grade (Bad, Good, Great)
We will use boxplot per wine grade to understand descriptional properties of different variables deeper.
It is clear from the above box plot that we have outliers in multiple variables that may cloud our conclusions. Hence, we will be subsetting the dataset to reflect only 95% of each variable individually.
We will NOT drop any data from the dataset at this point, instead, we will adjust analysis window for the 95 percentile to eliminate outliers noise.
In the next few sections, we will try to analyze Quality as the main output variable together with building a deeper understanding of other variables inter-dependencies.
Plotting Quality vs. other wine attributes while maintaining color-code for wine grade (Bad, Good, Great) to identify if there is any pattern associated with great wines.
It is clear from the above box diagram that great quality wines have a strong positive correlation with alcohol density. The linear model plotted in orange shows a strong linear growth of wine quality with alcohol increase in wine.
The box plot also highlights the strong negative correlation between wine quality and Density, Chlorides and Volatile Acidity but the fact we have multiple outliers in those graphs is what is making the visual association harder.
Hence, we will plot the same parameters distribution per wine grade for ascending 95% ~ 98% of the dataset population to eliminate the last 2% ~ 5% outliers.
Density distribution of quality key paraemters, color coded by wine grade
We can infer from the above plot that,
A. Great quality wines have the highest median Alcohol level.
B. Great quality wines have the lowest median Density.
C. Great quality wines tend to have less Chlorides.
D. Great quality wines tend to have less Volatile Acidity levels.
Alcohol have a strong negative correlation with Density, and a weak negative correlation with Residual Sugar, Total Sulfur Dioxide, and Chlorides.
Dataset used in the following plot has been modified to remove outliers above the 95 percentile of the under-analysis attributes.
Analysis shows that
Density is strongly correlated with Residual Sugar at r = 0.84
Density is weakly correlated with Total Sulfur Dioxide at r = 0.53
Dataset used in the following plot has been modified to remove outliers above the 95 percentile of the under-analysis attributes.
Previous plot shows a weak correlation between Residual Sugar and Total Sulfur Dioxide with correlation coefficient “r” equals 0.4
With regards to Wine Quality
Great quality wines have the highest median Alcohol levels and the lowest median Density levels.
Great quality wines tend to have less Chlorides & Volatile Acidity levels.
We have also noticed the following strong correlations in the dataset
Alchohol is strongly correlated with Density with correlation coefficient “r” equals -0.78.
Density is strongly correlated with Residual Sugar with correlation coefficient “r” equals 0.84.
Dataset used in the following plot has been modified to remove outliers above the 95 percentile of the Density attributes.
Based on the findings in the bivariant analysis, in this section we will try to understand the relationship between wine quality and its strongly correlated variables in a multivariate environment.
Above diagram infer that, for the same Density level, great quality wines tend to have higher Alcohol content.
Let’s isolate the great-quality wine in its own graph to see if this statement still hold.
As we can see in the above plot, the great wine quality category tend to have higher rating with increased alcohol levels at the same density range.
On the other hand, the Bad wine quality category tend to have very low ratings with increased alcohol levels. Looks like there is a second order variable in play here.
Hence, we will start analyzing those Quality vs. second order variables.
We can infer from the above diagram that.
We can increase the plot contrast via dropping the middle grade (Good Wine) wine group from the dataset & try to replot the graphs one more time.
It is clear from the above plot that,
It is hard to draw conclusions from the above plot, let’s try to drop the middle grade wine (Good Wine) from the data set and try again.
We can infer from the above plot that,
Dataset used in the following plot has been modified to remove outliers above the 95 percentile of the Total Sulfur Dioxide & Residual Sugar attributes.
Dataset used in the following plot has been modified to remove the Good quality subgroup for better contrast.
We can infer from the above plot that, For the same Residual Sugar level, great quality wines tend to have lower Total Sulfur Dioxide content.
We can summarize the key conclusions we gathered from the multivariate analysis into the following list.
## Load fresh version of data
ww_lm <- read.csv('wineQualityWhites.csv')
m1 <- lm(quality ~ alcohol, data = ww_lm)
m2 <- update(m1, ~ . - alcohol + density)
m3 <- update(m2, ~ . + alcohol)
m4 <- update(m3, ~ . + fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
chlorides + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates)
mtable(m1, m2, m3, m4, sdigits = 3)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = ww_lm)
## m2: lm(formula = quality ~ density, data = ww_lm)
## m3: lm(formula = quality ~ density + alcohol, data = ww_lm)
## m4: lm(formula = quality ~ density + alcohol + fixed.acidity + volatile.acidity +
## citric.acid + residual.sugar + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates, data = ww_lm)
##
## ================================================================================
## m1 m2 m3 m4
## --------------------------------------------------------------------------------
## (Intercept) 2.582*** 96.277*** -22.492*** 150.193***
## (0.098) (4.003) (6.165) (18.804)
## alcohol 0.313*** 0.360*** 0.193***
## (0.009) (0.015) (0.024)
## density -90.942*** 24.728*** -150.284***
## (4.027) (6.079) (19.075)
## fixed.acidity 0.066**
## (0.021)
## volatile.acidity -1.863***
## (0.114)
## citric.acid 0.022
## (0.096)
## residual.sugar 0.081***
## (0.008)
## chlorides -0.247
## (0.547)
## free.sulfur.dioxide 0.004***
## (0.001)
## total.sulfur.dioxide -0.000
## (0.000)
## pH 0.686***
## (0.105)
## sulphates 0.631***
## (0.100)
## --------------------------------------------------------------------------------
## R-squared 0.190 0.094 0.192 0.282
## adj. R-squared 0.190 0.094 0.192 0.280
## sigma 0.797 0.843 0.796 0.751
## F 1146.395 509.911 583.290 174.344
## p 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -6111.983 -5831.127 -5543.740
## Deviance 3112.257 3478.689 3101.773 2758.329
## AIC 11684.782 12229.967 11670.255 11113.480
## BIC 11704.272 12249.456 11696.241 11197.936
## N 4898 4898 4898 4898
## ================================================================================
# fml1 <- as.formula(paste("quality", "~", paste(INDEPENDENT, collapse=' + ')))
#m1 <- lm(fml1, ww_lm)
Linear regression analysis has led to the following observations,
Conclusion:
Linear modeling is not sufficient to predict white wine quality.
Above correlation plot omits statistically insignificant results from the correlation graph at \(\beta \leq 0.05\)
Graph above infers the following significant correlation relations
Strong positive correlation between Density and Residual Sugar
Strong negative correlation between Alcohol and Density
Medium positive correlation between
Quality and Alcohol
Density and Total Sulfur Dioxide
Medium negative correlation between
Quality and Density
Alcohol and each of (Residual Sugar, Total Sulfur Dioxide, Chlorides, Free Sulfur Dioxide)
Density distribution of Alcohol & Density color coded by wine grade
Above plot infer that great quality wines have the highest median Alcohol level and the lowest median density.
Above plot infers that
For the same Alcohol content, great quality wines tend to have lower Residual Sugar value & range.
High levels of chlorides only exists in bad quality wines.
For the same Volatile Acidity level, great quality wines tend to have higher Alcohol content.
Samples with high Volatile Acidity or high Residual Sugar tend to have lower than average Alcohol levels.
The white wine dataset used in this analysis contains multiple physical & chemical attributes & properties together with rating of wine quality. across the 5000 samples, linear regression has failed to predict white wine quality. Yet we have noticed a strong correlation between white wine quality and alcohol where 20% of quality variance can be explained by change in alcohol content.